## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
It appears that volatile acidity, pH and fixed acidity are normally distributed, with few outliers. Residual sugar, free sulfur dioxide, total sulfur dioxide seem to be long-tailed and alcohol seems to have a bimodal distribution. Qualitatively, residual sugar and sulfur dioxide have extreme outliers.
The histogram follows a normal distribution and we can see that there is a high concentration of wines with fixed acidity close to 7.90, the median, but there are also some outliers that pushes the mean up to 9.2 (3rd quatile).
The volatile acidity distribution seems to be bimodal with max points at 0.39 (1st Quartile) and 0.64 (3rd Quartile) and some outliers in the higher ranges.
A high concentration of wines around 2.2, the median, and some outliers with a max of 15.50.
The free sulfur dioxide distribution resembles a long tailed distribution with few outliers over 60, a median of 14.40 and many samples around 7.0 (1st Quartile).
This data set consists of thirteen variables, with almost 1,599 observations. The variables are fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol and quality. The dataset does not have ordered factor variables.
Some interesting observations are that the median quality is 5.36, the max pH, which describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic), is 4.01 and most wines are between 3-4 on the pH scale [1].
We can also notice that about 75% of wines have less than 0,1 chlorides or amount of salt in the wine.
The main feature in the data set is the quality and how the ingredients combine to make a good quality wine.
To qualify the wine quality I think that I need to consider the relationship of four components sweetness, acidity, and alcohol. A good quality wine is balanced one.
Considering this information some variables that ew should pay attention are pH, alcohol and residual sugar.
Yes, i created a factor variable to classify the quality in 3 categories, regular, good and excelent.
An analysis about the density and alcohol relation shoul be important as the density is very related to the alcohol variable (-0.49617977).
The density have an alcohol concentration peak around 10.0 and seems to have an interesting relation with the wine quality.
Plotting the Alcohol x Quality relation we can see that better wines have more alcohol concentration. Analysing the previous plot compared to this one it seems that a lower density results in a better wine.
Now I will use boxplots to further explore the relationship between quality and some other varibles and find what drives it up.
As the correlation table showed, fixed.acidity seems to have little effect on wine quality, if compared to the other elements.
volatile.acidity seems to be an bad feature in wines. Quality seems to go up when volatile.acidity goes down.
From this plot it seems that better wines tend to have a lower concentration of citric acid, the median is equal to 0.25.
Contrary to what I was expecting the residual sugar apparently seems to have no effect on wine quality. Maybe residual sugar is just a matter of personal taste instead of quality consensus.
Even with a little correlation, a lower concentration of chlorides seem to produce better wines.
Better wines tend to have lower densities, but this is probably due to the alcohol concentration, as showed before.
Correlation refers to a technique used to measure the relationship between two or more variables. When two things are correlated, it means that they vary together.
A positive correlation means that high scores on one are associated with high scores on the other, and that low scores on one are associated with low scores on the other.
On the other hand, negative correlation means that high scores on the first thing are associated with low scores on the second. It also means that low scores on the first are associated with high scores on the second. [2]
As quality is our main feature of interest I should analyse what correlates more with it. Comparing the correlation table the quality variable is more correlate with the :
This shows a different correlation from what I expected analysing the univariate variables and the literature.
[2] http://statisticalconcepts.blogspot.com.br/2010/04/interpretation-of-correlation.html
As expected density and alcohol have a negative correlation. Given that alcohol is really related with the quality of the wine a good wine should have a low density.
I also noticed that pH and density are two of the most related variables within fixed acidity and also those correlations are among the biggest within the dataset.
pH and fixed acidity with a correlation of -0.68297819 which is expected as pH measures acidity.
This firs plot shows the relation between Citric Acid, Volatile Acidify and Quality. As we can see we have some wines with 0 volatile acidity and many with low citric acidity.
When comparing the excelent wines we see maybe one outlier with volatily acidity close to 0.8 and almost 0 citric adidity. Considering that the mean of citric acidity is 0.271 and the first quartile equals 0.090 it may be an outlier or maybe because of the lack of more data about excellent wines i can not be sure.
Again we can see some wines with pretty close characteristics but classified differently, for example classified as good wines instead of excellent. Maybe the excelent ones are not well classified because we just have a small dataset.
Comparing those two variables we notice a great dispersion at the excellent wines plots. Maybe those two variables together are not very helpful to classify the quality.
Again we have the same dispersion as shown above.
Now we have more homogeneous information about quality and ca say that excellent wines tend to have a sulphate range of [-0.25 , 0.00] and citric acidity around [0.00 , 0.75].
Another plot with well defined ranges.
Finally another very disperse plot that can hide conclusions based on the lack of information.
I primarily examined the features which showed more correlation with quality. Then I plotted a combination of every variable to see this relation compared to the wine quality.
And it became clear that a higher citric acid and lower volatile acid contributes towards better wines.
Another point is that better wines are used to have higher sulphates and alcohol contents. But the range betwwen excellent quality wines and good/regular ones is very clear as the sulphates drops to less than -0.25.
One interesting point is that many regular and good wines have citric acid levels equal to 0 and none of the excellent ones have this and doing a little research we can see that this acid if added to an almost finished wine to increase acidity, citric acid gives the wine a freshness of flavor that seems. [3]
This plot shows the relationship between alcohol concentration and wine quality and how alcohol effects the quality of wines. But we need to keep in mind that wine quality never comes down to a single factor. Color, structure, flavor and typicity are all important. That is why wines are admired for their harmony and complexity, whether its alcohol level is low or high.
A range between 0.25 up to 0.5 citric acid and less than 0.4 volatile acidity combined seems to produce better wines. Acids are one of 4 fundamental traits in wine (the others are tannin, alcohol and sweetness). Acidity gives wine its tart and sour taste. Fundamentally speaking, all wines lie on the acidic side of the pH spectrum and most range from 2.5 to about 4.5 pH (7 is neutral).[4]
[4] http://winefolly.com/review/understanding-acidity-in-wine/
The last plot should answer the most wanted question, what makes the best wine? Comparing the most correlated variables we can see that an excelent wine have 12% alcohol with a max of 0.4g/dm volatile acidity.
This was an interesting walk into the red wine world to study what influence wines quality. It is a good starting point but it lacks more data, especially about excellent wines, as the histogram shows the most part of the wines are regular or good with just a few outliers pushing to the excellence direction.
It is good to notice that almost every plot showed either an normal or long tailed distribution.
After exploring the individual variables, I proceded to investigate their correlation and then after some plotting I could confirm some assumptions that the alcohol has great influence on the wine quality but also that other chemical components can have an interesting contribution to the quality as the citric acid.
I also tried investigating the effect of some elements in the overall wine quality. I choose boxplots to explore the relationships graphically because of it simplicity to evaluate the data distribution.
On the final part of the analysis I tried using multivariate plots to investigate if there were interesting combinations of variables that might affect quality. One interesting point was that many regular and good wines have citric acid levels equal to 0 and none of the excellent ones have this acid. Doing a little research I discovered that this acid, if added to an almost finished wine, increase the acidity giving the wine a freshness of flavor.